Chinese Part-of-Speech Tagging: One-at-a-Time or All-at-Once? Word-Based or Character-Based?

نویسندگان

  • Hwee Tou Ng
  • Jin Kiat Low
چکیده

Chinese part-of-speech (POS) tagging assigns one POS tag to each word in a Chinese sentence. However, since words are not demarcated in a Chinese sentence, Chinese POS tagging requires word segmentation as a prerequisite. We could perform Chinese POS tagging strictly after word segmentation (one-at-a-time approach), or perform both word segmentation and POS tagging in a combined, single step simultaneously (all-atonce approach). Also, we could choose to assign POS tags on a word-by-word basis, making use of word features in the surrounding context (word-based), or on a character-by-character basis with character features (character-based). This paper presents an in-depth study on such issues of processing architecture and feature representation for Chinese POS tagging, within a maximum entropy framework. We found that while the all-at-once, characterbased approach is the best, the one-at-a-time, character-based approach is a worthwhile compromise, performing only slightly worse in terms of accuracy, but taking shorter time to train and run. As part of our investigation, we also built a state-of-the-art Chinese word segmenter, which outperforms the best SIGHAN 2003 word segmenters in the closed track on 3 out of 4 test corpora.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

A Stacked Sub-Word Model for Joint Chinese Word Segmentation and Part-of-Speech Tagging

The large combined search space of joint word segmentation and Part-of-Speech (POS) tagging makes efficient decoding very hard. As a result, effective high order features representing rich contexts are inconvenient to use. In this work, we propose a novel stacked subword model for this task, concerning both efficiency and effectiveness. Our solution is a two step process. First, one word-based ...

متن کامل

CRF-based Hybrid Model for Word Segmentation, NER and even POS Tagging

This paper presents systems submitted to the close track of Fourth SIGHAN Bakeoff. We built up three systems based on Conditional Random Field for Chinese Word Segmentation, Named Entity Recognition and Part-Of-Speech Tagging respectively. Our systems employed basic features as well as a large number of linguistic features. For segmentation task, we adjusted the BIO tags according to confidence...

متن کامل

A Cascaded Linear Model for Joint Chinese Word Segmentation and Part-of-Speech Tagging

We propose a cascaded linear model for joint Chinese word segmentation and partof-speech tagging. With a character-based perceptron as the core, combined with realvalued features such as language models, the cascaded model is able to efficiently utilize knowledge sources that are inconvenient to incorporate into the perceptron directly. Experiments show that the cascaded model achieves improved...

متن کامل

Character-Level Dependency Model for Joint Word Segmentation, POS Tagging, and Dependency Parsing in Chinese

Recent work on joint word segmentation, POS (Part Of Speech) tagging, and dependency parsing in Chinese has two key problems: the first is that word segmentation based on character and dependency parsing based on word were not combined well in the transition-based framework, and the second is that the joint model suffers from the insufficiency of annotated corpus. In order to resolve the first ...

متن کامل

برچسب‌گذاری ادات سخن زبان فارسی با استفاده از مدل شبکۀ فازی

Part of speech tagging (POS tagging) is an ongoing research in natural language processing (NLP) applications. The process of classifying words into their parts of speech and labeling them accordingly is known as part-of-speech tagging, POS-tagging, or simply tagging. Parts of speech are also known as word classes or lexical categories. The purpose of POS tagging is determining the grammatical ...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2004